RegEx multiline matching [Solved]

Hi all,

Main question, I have a text file, with some content that closely resembles HTML markup (because it is) however, the way the text is arranged I can extract some information from it. The following is the content of the file (literally). The exact same text.

<div><b>EmployeeName:</b> Luckas Duckins</div>
<div><b>CCName:</b> Mike McMice</div>
<div><b>CCEmail:</b> MikeMcMice@funinc.com</div>
<div><b>ExpirationDate:</b> 7/17/2015</div>

I have a script that was working last Friday, but when I went back today to keep working on it, I got no match, <strike>so I wonder what is it that I was doing last Friday that I did not do today</strike>. Script as follows:

$MyPath = "c:\Path\to\textfile.txt"
$regex99 = @'
(?ms)<div><b>EmployeeName:<\/b> (.+?)</div>
<div><b>CCName:<\/b> (.+?)<\/div>
<div><b>CCEmail:<\/b> (.+?)<\/div>
<div><b>ExpirationDate:<\/b> (.+?)<\/div>
'@

[IO.File]::ReadAllText($MyPath) -match $regex99
if ([IO.File]::ReadAllText($Mypath) -match $regex99)
  {
   $EmployeeName = $matches[1]
   $CCName = $matches[2] 
   $CCEmail = $matches[3] 
   $ExpirtationDate = $matches[4] 
  }
"EmpName"
$EmployeeName 
"CC Name"
$CCName 
"CC Email"
$CCEmail 
"EXP Date"
$ExpirtationDate 

#output was
#True
#EmpName 
#Luckas Duckins
#CC Name
#Mike McMice
#CC Email
#MikeMcMice@funinc.com
#EXP Date
#7/17/2015

<strike>Right now I just get a big False</strike>. I suspect the issue may be regarding the file itself. I resaved the file (after adding a new line at the end of the file), and the script worked. Then, I removed the new line, and the script works. If I try either of the following regex, each one works, but I am trying to get it on one go.

$regex99 = @'
(?ms)<div><b>EmployeeName:<\/b>\s(.+?)<\/div>
'@

$regex99 = @'
(?ms)<div><b>CCName:<\/b>\s(.+?)<\/div>
'@

$regex99 = @'
(?ms)<div><b>CCEmail:<\/b>\s(.+?)<\/div>
'@

I have used https://mjolinor.wordpress.com/2012/01/05/powershell-multiline-regex-matching/ as a reference, as well as a post I found on Stackoverflow <strike>(cannot find it anymore :( )</strike>

Any help is appreciated.

UPDATE:

Found the post on Stackoverflow that I used as reference. http://stackoverflow.com/questions/15375921/powershell-parse-parts-of-a-text-file-and-save-to-csv

UPDATE 2:

I kept working on the script and I modified the text file, so basically after resaving the file the script worked.

Background about the text file. I get the text content from another script, I save the text on the text file, then I read the file to process it.

Is it possible to save the text to a variable, and keep the text as a here string o I can process it?


July 20th, 2015 11:55am

Patterns for matching cannot be stored in"here" strings:

$regex99='(?ms)<div><b>EmployeeName:<\/b>\s(.+?)<\/div>'

Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 12:22pm

As shown on this script https://mjolinor.wordpress.com/2012/01/05/powershell-multiline-regex-matching/ seems they do. I have tested the script shown on that post myself, and edited a little to get the version number from the here string sample as well, it worked. I just wonder where is my regex wrong. I also found the post I used as reference http://stackoverflow.com/questions/15375921/powershell-parse-parts-of-a-text-file-and-save-to-csv, there the http://stackoverflow.com/posts/15382469/revisions regex is on a here string.

July 20th, 2015 12:57pm

I would recommend a line-by-line approach and separate regular expressions based on input line, rather than trying to create a single regex that matches everything.

As the old saying goes, now you have two problems.

The simpler the regex, the better.

Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 1:41pm

The problem is that if the line terminators are complex the match will fail.  HTML pages may have only a linefeed or both cr and lf or may have nothing at all.

Mostly I don't understand your question.  You say it work but that you get a false.  If you get a false then it doesn't work.

We have no idea what is in your file.

July 20th, 2015 1:49pm

To continue on Bill's line; use multiple patterns and passes.  One for each extraction.  That would be most reliable.

PS >$html=@'
<div><b>EmployeeName:</b> Luckas Duckins</div>
<div><b>CCName:</b> Mike McMice</div>
<div><b>CCEmail:</b> MikeMcMice@funinc.com</div>
<div><b>ExpirationDate:</b> 7/17/2015</div>
'@

PS > if($html -match 'EmployeeName:</b>(?<x>.*)</div>') { $matches['x'] }
Luckas Duckins
PS > if($html -match 'CCName:</b>(?<x>.*)</div>') { $matches['x'] }
Mike McMice
PS >
Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 1:54pm

The problem is that if the line terminators are complex the match will fail.  HTML pages may have only a linefeed or both cr and lf or may have nothing at all.

Mostly I don't understand your question.  You say it work but that you get a false.  If you get a false then it doesn't work.

We have no idea what is in your file.

July 20th, 2015 3:04pm

This is crazy. I changed the regex as follows:

$regex = @'
(?ms)<div><b>EmployeeName:<\/b> (.+?)</div>\n<div><b>CCName:<\/b> (.+?)<\/div>\n<div><b>CCEmail:<\/b> (.+?)<\/div>\n<div><b>ExpirationDate:<\/b> (.+?)<\/div>
'@

I had actual new lines on the regex, now I replaced those for escaped newlines (so the regex is on one line) and now the script works without issue. I just want to know why?

Looks like we are all good now. Any ideas on how to replace the new lines from whichever new line is used to a specific newline?

Thanks. 

Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 3:14pm

You are using a regex new line \n. Windows uses \r\n as a line terminator.  That causes the match to fail. Unix and most web servers use \n or no line breaks.

July 20th, 2015 3:23pm

Just an update. I am not using the text file anymore. However, this still applies. I get the information from the HTML page, and then I run a replace on it to match the newLines.
$feedURL = "http://website.com/feed/getfeed/" #sample url for AtomFeed
#property object
$property = New-Object System.Collections.Specialized.OrderedDictionary
$property.Add('UseDefaultCredentials', $true)

#I get the AtomFeed specific property that I  need
$result = ((New-Object Net.Webclient -Property $property ).DownloadString($feedURL) -as [xml]).rss.channel.item[0].description.InnerText 

#I replace the new line with the newline that matches my OS (Windows)
$result = $result.Replace("`n","`r`n")

#Then I run the former script
$regex = @'
(?ms)<div><b>EmployeeName:<\/b> (.+?)</div>
<div><b>CCName:<\/b> (.+?)<\/div>
<div><b>CCEmail:<\/b> (.+?)<\/div>
<div><b>ExpirationDate:<\/b> (.+?)<\/div>
'@

#reset variables
$EmployeeName, $CCName, $CCEmail, $ExpirtationDate = $null
#check if there are matches
$result -match $regex
#get the values I want
if ($resultHere -match $regex)
  {
   $EmployeeName = $matches[1]
   $CCName = $matches[2] 
   $CCEmail = $matches[3] 
   $ExpirtationDate = $matches[4] 
  }

$EmployeeName 
$CCName 
$CCEmail 
$ExpirtationDate 
That works for me.
Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 4:24pm

rss-atom feeds can be directly converted to XML as they are XML.  It you use a webclient then formatting is applied and the XML is lost.

Just grab the xml and it will beeasy to uery with XPAth.

July 20th, 2015 4:38pm

This feed has both iTunes and atom namespaces.

$feed=Invoke-WebRequest 'http://www.sciencefriday.com/audio/scifriaudio.xml'
$xml=[xml]$feed.Content
$xml.rss.channel

Free Windows Admin Tool Kit Click here and download it now
July 20th, 2015 4:45pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics